Data Mining
With SSAS, a much more robust selection of capabilities for data mining is available..
Data mining
is the process of understanding potentially undiscovered
characteristics or distributions of data. Data mining can be extremely
useful for OLAP database design in that patterns or values might define
different hierarchy levels or dimensions that were not previously known. As you create dimensions, you can even choose a data mining model as the basis for a dimension.
Basically, a data
mining model is a reference structure that represents the grouping and
predictive analysis of relational or multidimensional data. It is
composed of rules, patterns, and other statistical information of the
data that it was analyzing. These are called cases. A case set
is simply a means for viewing the physical data. Different case sets
can be constructed from the same physical data. Basically, a case is
defined from a particular point of view. If the algorithm you are using
supports the view, you can use mining models to make predictions based
on these findings.
Another aspect of a data mining
model is using training data. This process determines the relative
importance of each attribute in a data mining model. It does this by
recursively partitioning data into smaller groups until no more
splitting can occur. During this partitioning process, information is
gathered from the attributes used to determine the split. Probability
can be established for each categorization of data in these splits.
This type of data can be used to help determine factors about other
data utilizing these probabilities. This training data, in the form of
dimensions, levels, member properties and measures, is used to process
the OLAP data mining model and further define the data mining column
structure for the case set.
In SSAS, Microsoft provides several data mining algorithms (or techniques):
Association Rules—
This algorithm builds rules that describe which items are most likely
to appear together in a transaction. The rules help predict when the
presence of one item is likely with another item (which has appeared in
the same type of transaction before).
Clustering—
This algorithm uses iterative techniques to group records from a
dataset into clusters that contain similar characteristics. This is one
of the best algorithms, and it can be used to find general groupings in
data.
Sequence Clustering— This
algorithm is a combination of sequence analysis and clustering, and it
identifies clusters of similarly ordered events in a sequence. The
clusters can be used to predict the likely ordering of events in a
sequence, based on known characteristics.
Decision Trees—
This classification algorithm works well for predictive modeling. It
supports the prediction of both discrete and continuous attributes.
Linear Regression—
This regression algorithm works well for regression modeling. It is a
configuration variation of the Decision Trees algorithm, obtained by
disabling splits. (The whole regression formula is built in a single
root node.) The algorithm supports the prediction of continuous
attributes.
Logistic Regression—
This regression algorithm works well for regression modeling. It is a
configuration variation of the Neural Network algorithm, obtained by
eliminating the hidden layer. This algorithm supports the prediction of
both discrete and continuous attributes.
Naïve Bayes—
This classification algorithm is quick to build, and it works well for
predictive modeling. It supports only discrete attributes, and it
considers all the input attributes to be independent, given the
predictable attribute.
Neural Network—
This algorithm uses a gradient method to optimize parameters of
multilayer networks to predict multiple attributes. It can be used for
classification of discrete attributes as well as regression of
continuous attributes.
Time Series—
This algorithm uses a linear regression decision tree approach to
analyze time-related data, such as monthly sales data or yearly
profits. The patterns it discovers can be used to predict values for
future time steps across a time horizon.
To create an OLAP data
mining model, SSAS uses either an existing source OLAP cube or an
existing relational database/data warehouse, a particular data mining
technique/algorithm, case dimension and level, predicted entity, or,
optionally, training data. The source OLAP cube provides the
information needed to create a case set for the data mining model. You
then select the data mining technique (decision tree, clustering, or
one of the others). It uses the dimension and level that you choose to
establish key columns for the case sets. The case dimension and level
provide a certain orientation for the data mining model into the cube
for creating a case set. The predicted entity can be either a measure
from the source OLAP cube, a member property of the case dimension and
level, or any member of another dimension in the source OLAP cube.
Note
The Data Mining Wizard can also
create a new dimension for a source cube and enables users to query the
data mining data model data just as they would query OLAP data (using
the SQL DMX extension or the mining structures browser).
In Visual Studio, you simply initiate the Data Mining Wizard by right-clicking the Mining Structures
entry in the Solution Explorer. You cannot create new mining structures
from SSMS. When you are past the wizard’s splash screen, you have the
option of creating your mining model from either an existing relational
database (or data warehouse) or an existing OLAP cube (as shown in Figure 53).
You want to define a data
mining model that can shed light on product (SKU) sales characteristics
and that will be based on the data and structure you have created so
far in your Comp Sales Unleashed cube. For this example, you choose to
use the existing OLAP cube you already have (from the existing cube
method).
You must now select the data
mining technique you think will help you find value in your cube’s
data. Clustering is probably the best one to start from because it
finds natural groupings of data in a multidimensional space. It is
useful when you want to see general groupings in your data, such as hot
spots. You are trying to find just such things with sales of products
(for example, things that sell together or belong together). Figure 54 shows the data mining technique Microsoft Clustering being selected.
Now you have to identify the source cube dimension to use to build the mining structure. As you can see in Figure 55, you choose Product Dimension to fit the mining intentions stated earlier.
You then select the case key or point of view for the mining analysis. Figure 56 illustrates the case to be based on the product dimension and at the SKU level (that is, the individual product level).
You now specify the attributes and measures as case-level columns of the new mining structure. Figure 57
shows the possible selections. You can simply choose all the data
measures for this mining structure. Then you click the Next button.
As you can see in Figure 58,
the next few wizard dialogs allow you to specify the mining structure
column’s content and data types (use the defaults that were detected
for most items unless we specifically describe something different),
identify a filtered slice to use for the model training (you don’t need
to use this now because you want the whole cube), and finally identify
the number of cases to be reserved for model testing (use a percentage
of data for testing to be about 33%).
The mining model is now specified and must be named and processed. Figure 59 shows what you have named the mining structure (Product Dimension MS) and the mining model name itself (Product Dimension MM).
Also, you select the Allow Drill Through option so you can look further
into the data in the mining model after it is processed. Then you click
the Finish button.
When the Data Mining
Wizard is complete, the mining structure viewer pops up, with your
mining structure case-level column’s specification (on the center left)
and its correlation to your cube (see Figure 60).
You must now process the
mining structure to see what you come up with. You do this by selecting
the Mining Model toolbar option and selecting the Process option. You
then see the usual Process dialog, and you have to choose to run this
(process the mining structure). After the mining structure processing
completes, a quick click on the Cluster Diagram tab shows the results
of the clustering analysis (see Figure 61).
Notice that because you selected to allow drill through, you can simply
right-click any of the clusters identified and see the data that is
part of the cluster (and choose Drill Through). This viewer clearly
shows that there is some clustering of SKU values that might indicate
products that sell together or belong together.
If you click the Cluster Profiles tab of this viewer, you see the data value profile characteristics that were processed (see Figure 62).
Figure 63
shows the clusters of data values of each data measure in the data
mining model. This characteristic information gives you a good idea of
what the actual data values are and how they cluster together.
Finally, you can see the
cluster node contents at the detail level by changing the mining model
viewer type to Microsoft Generic Content Tree Viewer, which is just
below the Mining Model Viewer tab on top. Figure 64 shows the detail contents of each model node and its technical specification of a report format.
If you want, you can now
build new cube dimensions that can help you do predictive modeling
based on the findings of the data mining structures you just processed.
In this way, you could predict sales units of one SKU and the number of
naturally clustered SKUs quite easily (based on the past data mining
analysis). This type of predictive modeling is very powerful.